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ABSTRACT 

The National Center for Research on Evaluation, 
Standards, and Student Testing (CRESST) is a partnership of the 
University of California at Los Angeles, the University of Colorado 
at Boulder, Stanford University, The RAND Corporation, the University 
of Pittsburgh, the Educational Testing Service, and the University of 
California, Santa Barbara. This issue of "Evaluation Comment" shares 
the goals and perspectives that will shape CRESST' s research program 
for the next 5 years. With a focus on the assessment of education 
quality, CRESST expects to study: (1) assessment that leads to 
improvement in teaching and learning; (2) understanding and 
influencing assessment policy and large-scale practice; (3) improved 
technical knowledge about the quality of assessment; and (4) 
dissemination and outreach that successfully decreases the interval 
between research and practice. The conceptual model that will 
underlie the research program emphasizes societal impact as the 
ultimate goal and identifies four major domains: validity, fairness, 
credibility, and utility. This model will guide an ambitious agenda 
of research focusing on the areas of system coherence, adaptations 
and accommodations of assessments, the measurement of progress, and 
reporting. The issue also discusses the CRESST conference scheduled 
for September 1996 and 1996 CRESST resource papers and technical 
reports. (Contains 1 figure and 137 references.) (SLD) 
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T he newly awarded National Center for 
Research on Evaluation, Standards, and 
Student Testing (CRESST) is a partner- 
ship of UCLA, the University of Colorado at Boul- 
der, Stanford University, The RAND Corporation, 
the University of Pittsburgh, the Educational Test- 
ing Service, and the University of California, Santa 
Barbara, In this Comment , we share the goals and 
perspectives that will shape our research program for 
the next five years. With the assessment ofeducation 
quality our focus, we commit ourselves to four key 
programs of work: assessment that leads to improve- 
ments in teaching and learning; understanding and 
influencing assessment policy and large-scale prac- 
tice; improved technical knowledge about the qual- 
ity of assessment; and dissemination and outreach 
that successfully decreases the interval between re- 
search and practice . Our programs are driven by our 
desire to meet the immediate and future needs of 
education policy and practice, yet reflect the histori- 
cal lessons and current assessment trends across 
America, 

Trends in Assessment Policy 
Throughout this century educational testing has 
been called upon to serve many different purposes. It 
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The CRESST Conference p. 2 

New CRESST Resource Papers & 
Technical Reports p. 25 



has been used to allocate scarce resources through 
student selection, to place children in educational 
programs, to monitor student achievement, and to 
hold educators accountable for student performance. 
Reformers have used test results to document defi- 
ciencies in order to help build the case that change 
was needed. They have also relied on testing as a 
major instrument of reform (see, for example, U.S, 
Congress, Office of Technology Assessment, 
1992). Not surprisingly, testing has also been at the 
center of frequent and sometimes intense contro- 
versy (Cronbach, 1975). 
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The 1996 CRESST Conference 



P lease mark your calendars to attend the 1996 CRESST Conference at UCLA’s Sunset Village, 
September 5-6, 1996. CRESST partners and other distinguished colleagues will present 
findings from recent K- 12 assessment research and discuss issues for upcoming research projects. 
A tentative schedule of presenters and sessions is provided below and on pages 23 and 24. 

The on-site $275 registration fee includes all meals and housing at the conference center for two days, a 
reception and formal dinner. Extra nights including meals are available at $95. The commuter fee for those 
not requiring housing is $125 and includes several meals, parking, and a reception. 

We must receive your registration form by August 19 and total payment by September 3. Space is limited 
to the first 300 registrants. Sorry, but absolutely no partial plans are available, and no refunds or changes may 
be made after September 3. 

Look for additional conference details in the summer CRESST Line issue and at the CRESST Web site, 
http://www.cse.ucla.edu. Or call CRESST at 310-206-1532. 



The 1996 CRESST Conference Agenda (Tentative) 

Note: Presenters , sessions, and titles are subject to change. . 

Thursday, September 5, 1996 



8:45-10:30 a.m. — The CRESST Assessment 
Model: Consolidating What We Know and Where 
We Need to Go 

• Overview of Key Validity Issues — Robert Linn , 
CRESST/University of Colorado at Boulder 

• Overview of Key Equity Issues — Edmund T. 
Gordon , CRESST /Yale University (Emeritus) 

• Creating Credible Assessments for the Public — 
Richard Colvin , Los Angeles Times 

• The Politics of Credibility — Lorraine McDonnell, 
CRESST/University of California, Santa Bar- 
bara 

1 0 :4 5 - noo n — V alidity and U tility of Assessment 
Systems 

• Standards for Assessment Systems — Joe Conaty , 
Office of Educational Research and Improve- 
ment (invited) 

• Large-Scale Systems Serving Multiple Purposes: 
The Tide I Standards and Assessment Challenge 
— Eva L. Baker , CRESST/UCLA 

• One State’s Response to the Challenge: The 
Washington State Example — Judy Billings, Wash- 
ington State Dept, of Public Instruction 

• The Face of Cultural Diversity in Assessment 
System Design — Roland Tharp, University of 
California, Santa Cruz 



1:15-2:45 p.m. — The CRESST Road Map: Pri- 
ority R&D Issues in Reaching Our Destination 

• Framing the Future of Assessment Systems — 
Joan Herman, CRESST/UCLA 

• From Standards to Assessments — Thomas 
Romberg, University of Wisconsin, Madison 

•Measuring Student Progress — Bengt Muthen, 
CRESST/UCLA 

• System Consequences for At-Risk Students — 
George Madaus, Boston College 

3:00-4:00 p.m. — Special Sessions From Centers’ 
Recent Research 

• Alignment of Content Standards and Assessment 
Measures in Mathematics and Science — Norman 
Webb, Wisconsin Center for Education Research 

• School and Classroom Interventions for At-Risk 
Students — Sylvia Johnson, Center for Research 
on the Education of Students Placed at Risk 

• An Overview of Research From the National 
Assessment of Educational Progress — Jamal 
Abedi , CRESST/UCLA; George Bohrnstedt, 
American Institutes for Research 

• Assessing Problem Solving in Science — Noreen 
Webb, CRESST/UCLA; Gail Baxter, CRESST/ 
University of Michigan (invited) 

(continued on page 23) 



1996 CRESST CONFERENCE REGISTRATION FORM 

Thursday, September 5 - Friday, September 6, L996 

ALL REGISTRANTS: Complete this form and mail with payment to: CRESST/UCLA , 10920 Wilshire 
Blvd., # 900 \ Los Angeles , CA 90024-651 1, Attn: Kathryn Morrison. Call or fax registration information 
immediately to ensure your space at the conference. Phone: (310)206-1532. Fax: (310)825-3883. We strongly 
suggest you make a copy of this form for you records. Reservations are due by August 19 y 1996 . 

Name (print) 

Title 

Organization (for name badge) 

Address 

City 

Phone _ 



Fax 



State _ 
E-mail 



Zip 



ALL CONFERENCE ATTENDEES, INCLUDING PRESENTERS, MUST SPECIFY BELOW 
THE NIGHTS THEY REQUIRE A ROOM. 

I will need a room for the following nights: 

□ Tues9/3 □ Wed 9/4 □Thur9/5 □ Fri 9/6 □ Sat 9/7 O None (Off-Site Registrant) 



PRESENTERS ONLY 



Presenters’ airline reservations MUST be made through American Express 
Travel at (800) 235-8252. Travel forms must be submitted after the 
conference for reimbursable expenses. Specify housing above. 

Registration y meals , and room fees are waived for presenters. 

Date and Time of Presentation: 

Audio-Visual Needs (overhead projectors provided): 



UCLA FACULTY . 
STUDENTS 8c 
STAFF 

Indicate if you are: 

□ UCLA Faculty 

Q UCLA Grad Student 

□ CRESST Research 
Staff 

Fees are waived for UCLA 
faculty and CRESST staff 



ON-SITE OR COMMUTER REGISTRANTS ONLY 

Registration options: On-site or Commuter. The $275 on-site conference registration fee includes 
Wednesday and Thursday night housing, parking if necessary and all meals for two days at the Sunset 
Village conference center. Extra nights ($95 per night) include all meals and housing. The $125 
commuter registration fee includes several meals and parking but no housing . It you need housing, 
Sunset Village is strongly recommended. 

FEES: Checks payable to: 

Regents of UC 



□ On-site ($275) 

(Wednesday & Thursday night housing included) 

□ Commuter ($125) 

Extra nights at $95 per night: 

□ Tues Q Fri Q Sat Q Other (specify) 



Registration Fee 
Extra Night/s 
Total 



$ 
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CRESST: A Continuing Mission To Improve Educational Assessment 



Many complaints about testing during the 1970s 
and 1980s emphasized bias against minority, female, 
and disadvantaged students (e.g., Haney, 1981; 
National Commission on Testing and Public Policy, 
1990) and secrecy. More recently, the debate has 
highlighted the perceived mismatch between tests 
designed to measure general achievement without 
clear ties to specific curriculum or instructional expe- 
riences and the need for assessments that are explic- 
itly linked to particular content standards or curricu- 
lum guidelines (Resnick 8c Resnick, 1992). The 
latter approach represents a shift in the formulation 
of the basic constructs to be measured by formal 
assessments from general ability to learned accom- 
plishments and a desire to use the same assessment 
for different purposes. 

This reformulation occurred for a number of 
reasons. First, traditional forms of assessment were 
gradually demystified. The 1979 test disclosure leg- 
islation in New York (S. B. 5200-A and subsequent 
amendments to Article 7- A of the New York Educa- 
tion Law) resulted in the publication of previously 
secure admissions tests and allowed leisurely perusal 
of items heretofore seen only in times of stress by 
respondents. Acknowledgments were made that test 
preparation could help performance (Bond, 1989; 
Messick 8c Jungeblut, 1981; Pike, 1978), especially 
when it led to generalized improvements in relevant 
knowledge and skills, for example, understanding of 
mathematics (Johnson & Wallace, 1989). Questions 
about norming practices — the Lake Wobegon ef- 
fect — raised by Canneli (1987) and examined by 
technical experts (for example, Koretz, 1988; Linn, 
Graue, 8c Sanders, 1990, Shepard, 1990) brought 
public discussion to previously unchallenged proce- 
dures. Changing views of student learning (see, e.g.. 
Brown 8c Campione, 1994; Chi, Glaser, 8c Farr, 
1988; Glaser, 1996; Greeno, 1995) suggested dif- 
ferent sorts of tests (Archibald 8c Newman, 1988; 



Baron, 1990; Frederiksen, 1984; Glaser 8c Silver, 
1994; Mislevy, 1994; Shavelson, Lang, 8c Lewin, 
1994; Stiggins, 1987; Wiggins, 1989) and led to the 
growing interest in performance assessment. 

In the winter of 1991, OER1 awarded an R8cD 
center with a new assessment mission — a mission 
that focused on the design and validation of these 
new types of performance assessments and studies of 
the impact of these assessments in practice. In part- 
nership with teacher organizations, the research 
community, Council of Chief State School Officers, 
numerous state leaders, district assessment person- 
nel, and an extended cadre of classroom teachers, 
CRESST articulated its mission in a list of criteria for 
the validity of new assessments (Linn, Baker, 8c 
Dunbar, 1991). 



Performance assessments suffered a 
crisis of credibility that continues to- 
day... 



Almost simultaneously, a national movement 
began focusing on content standards and the idea of 
connecting assessments deeply to clear expectations 
(National Council on Education Standards and Test- 
ing, 1992; Smith 8c O’Day, 1991). Supporting 
legislation in state after state and the activity of 
professional and scientific organizations, such as the 
National Council of Teachers of Mathematics 
(1989a, 1989b) and the National Academy of 
Sciences (1993), created a sense that U.S. assess- 
ment practices would undergo a significant change. 
Compatible changes were also being sought for the 
assessment and certification of accomplished teach- 
ers. Yet, just as this enterprise gained momentum, 
reservations about these approaches also surfaced. 
Performance assessments suffered a crisis of credibil- 
ity that continues today, a split that displays the 
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larger gap between the views of educational reform- 
ers and other segments of the public (Johnson & 
Immerwahr, 1994). 



...opposition to a new performance- 
based test was based on propriety of 
assessment content, perceived objec- 
tivity, and cost of administration and 
scoring. 



Some critics of new assessments objected to the 
idea that standards had a “national” rather than local 
inspiration (Bracev, 1995; Sizer, 1995). Contention 
about the content of some standards led to a recon- 
sideration of the wisdom of a national approach 
(Brimelow 8c Spencer, 1995; Rich, 1995). The 
California experience is a case in point, where oppo- 
sition to a new performance- based test was based on 
propriety of assessment content, perceived objectiv- 
ity, and cost of administration and scoring (e.g., 
Asimow, 1994). Opponents argued that the test 
neglected fundamental skills and academic content 
and, heightened by rumors that the assessment asked 
students to write personal experiences, therefore 
invaded family privacy. CRESST interviews with 
parents in those schools where the opposition was 
highest suggest that lack of information and misun- 
derstanding of the assessment contributed as much 
to parental concerns as did the content and new 
format of the test. In addition, analyses of California’s 
and other new performance assessments showed 
deficiencies in some technical properties (Cronbach, 
1995; Select Committee, 1994). 

These objections, meritorious or otherwise, in- 
fluenced the current state of assessment system de- 
velopment — a strategy far more complex than that 
envisioned in 1991, involving different formats of 
measures to meet public expectations. Yet, response 



to the credibility concern may be at the expense of 
the validity of system information. Many systems are 
adopting strategies that emphasize the local rather 
than national development of curriculum standards 
and expectations (Higuchi, 1995), more cautious 
advances on new forms of assessments, and a recom- 
mitment to standardized tests. Supported by the 
Improving America’s Schools Act ( 1994), the use of 
multiple measures to meet expectations of different 
constituencies and a focus on the inclusion of all 
students in assessments are characteristics of these 
assessments. The inclusion and accommodation re- 
quirements signal a change in the definition of 
fairness — from the protection of subgroups to the 
exposure of any differences in their performance with 
the intent of stimulating improved system efforts to 
alleviate the revealed inequities. 



Pressures will mount in these new 
systems to combine, equate, or solve 
methodologically messy conceptual, 
and possibly intractable, conflicts. 



These decisions create enormous challenges with 
regard to the formulation of approaches to study the 
quality of these information systems and the fairness, 
utility, and societal impact of the results they yield. 
We must recognize that any one element of one 
system will be subject to rapid change stemming 
from public perception, policy realignment, or from 
technical quality concerns. Pressures will mount in 
these new systems to combine, equate, or solve 
methodologically messy conceptual, and possibly 
intractable, conflicts. It is our intention to address 
these broad issues as much as possible in the real-time 
operations of states, districts and schools, rather than 
in the cleaner, neater world ofleisurely reanalysis and 
occasional data collection. For it is only in these 
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settings that we will be forced to confront the reality 
of public perception and technical quality. 

The current social and policy context leads us to 
a mission ultimately focused on the range of infor- 
mation in assessment systems. Although the specif- 
ics will vary greatly, there are a few enduring ques- 
tions that apply to systems and to individual mea- 
sures. Is the information produced credible? Are the 
resulting inferences supported? Does the assessment 
lead to desired actions? Is the testing useful for the 
different purposes it is intended to serve? Much of 
CRESST’s RScD will be guided by these broad 
questions. 

Conceptual Model for CRESST Research: Con- 
tributing to Knowledge, Educational Improve- 
ment, and Public Engagement 

For assessment systems to benefit education, 
they must provide accurate information, they must 
be conceived as precursors to reflection and action, 
and they must address the multiple frames of refer- 
ences of their users — the public, teachers, adminis- 
trators, policy makers, and, most of all, students. In 



Figure 1 we lay out the conceptual model underlying 
our new research program. It emphasizes societal 
impact as our ultimate goal: We seek to produce new 
knowledge and understanding about educational 
quality, to contribute to the use of assessment sys- 
tems for educational improvement — both in policy 
and accountability uses and in teaching and learn- 
ing — and to encourage productive, public engage- 
ment in education. The model identifies four major 
domains: validity, fairness, credibility, and utility. We 
assert that the utility and ultimate impact of assess- 
ment systems depend on the validity, fairness and 
credibility of the information produced by the sys- 
tem. All three characteristics ofassessment are neces- 
sary. Assessments of high technical quality are of little 
use unless their results are credible to key audiences. 
Similarly, credible but invalid or unfair results will 
falter in the long run, for they will produce mislead- 
ing interpretations or counterproductive, inequi- 
table actions. Therefore, validity, fairness, credibility, 
and utilitv provide the conceptual framework for the 
upcoming CRESST research and development pro- 
gram. 




Figure l. CRESST conceptual model. 
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Validity 

Validity is the core technical concept in educa- 
tional assessment (see, for example, AERA, APA, 8c 
NCME, 1985; Baker, O’Neil, 8c Linn, 1993; 
Cronbach, 1971, 1980, 1988; Linn, 1994; Messick, 
1989, 1994; Shepard, 1993), and a comprehensive 
view of validity drives our work. In his authoritative 
chapter on validity, Messick (1989) defines validity 
as “an integrated, evaluative judgment of the degree 
to which empirical evidence and theoretical ration- 
ales support the adequacy and appropriateness of 
inferences and actions based on test scores or other 
modes of assessment” (p. 13). 



In simple language, construct valid- 
ity is concerned with the meaning of 
the measures. 



Our work particularly highlights issues of con- 
struct validity. In simple language, construct validity 
is concerned with the meaning of the measures. 
Writing, mathematics, and history achievement are 
examples of constructs; so, too, are personal at- 
tributes such as motivation or self-concept. We 
address construct validity by asking questions such 
as: Does this procedure, intended to measure prob- 
lem solving, actually reflect higher order abilities 
rather than recall? Does this set of questions on 
geographic concepts provide a sound basis for gen- 
eralizing about a class’s understanding of the domain 
of geography? How much do difficulties with the 
English language impair the performance of some 
pupils on a mathematics assessment? 

Construct validity for a certain purpose is under- 
mined when the assessment is too narrow (“con- 
struct underrepresentation”) and also when it is too 
broad (“construct- irrelevant influences”). For ex- 
ample, in a physics assessment, construct under- 



representation occurs if the assessment tasks require 
responses to only a few non-representative ideas 
rather than a proper sample of physics topics or if the 
topics are inappropriately weighted. Construct- 
irrelevant influences are defined in terms of ancillary 
skills, that is, skills other than those that the assess- 
ment is intended to measure, that influence perfor- 
mance (Haertel 8c Wiley 1993; Wiley 8c Haertel, 
1996). Understanding of the language used in the 
assessment is perhaps the most common example of 
an ancillary skill where the intent of the assessment is 
knowledge of a content area. Other examples of 
ancillary skills include testwiseness, differential fa- 
miliarity with task formats, and personal characteris- 
tics such as test anxiety or impulsivity. 

Fairness 

Fairness is an essential aspect of the validity of any 
assessment or accountability system . The ideal is that 
the quality of the inferences drawn from information 
will be judged in terms of appropriateness for all 
people, of all backgrounds and needs. For this ideal 
to be approached, fairness must pertain to every 
aspect of the assessment process, from assessment 
design, through administration, and interpretation 
of results. Fairness becomes increasingly important 
as the stakes attached to results are raised. The 
general perception that the system is fair is also 
central to the credibility of assessment results. 



One common view of fairness is ob- 
jectivity, safeguards to assure that no 
one gets special advantage... 



Although fairness in testing has universal appeal, 
differing conceptions of fairness have been applied to 
the measurement process and have different implica- 
tions for both technical inferences and action. One 
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common view of fairness is objectivity, safeguards to 
assure that no one gets special advantage, an essential 
component of the underlying rationale for the stan- 
dardization of test administration and scoring 
(Cronbach 8c Suppes, 1969). A second aspect of 
fairness, the avoidance of bias, has often been taken 
to imply the avoidance of disadvantage, but more 
technically is the goal . . to limit the differential 
validity of a given interpretation” (Cole 8c Moss, 
1989, p. 205). In this conception of fairness, the 
construct to be measured is assumed to have the same 
meaning for all subgroups. 

A substantial amount of inquiry in the area of 
bias has focused on identifying test content and 
features with negative impact for subgroups (Cole 8c 
Nitko, 1981; Figueroa 8c Garcia, 1994), addressing 
questions such as the following: Does the assessment 
put minority students at a disadvantage because they 
are less familiar than their majority group counter- 
parts with content that is not essential to the skills 
and understandings that the assessment is intended 
to measure? (See, for example, Johnson, 1995.) 
Does the task context or any other non-essential 
aspect give an unfair advantage to boys in compari- 
son to girls, or vice versa? (See Tittle, 1975.) Do 
skills that are ancillary to the intent of the assessment, 
such as reading ability for a mathematics problem- 
solving assessment, create an unfair disadvantage for 
certain students, for example, students with limited 
English proficiency? (See August, Hakuta, 8c Pompa, 
1994; Haertel 8c Linn, 1996; Haertel 8c Wiley, 
1993; Wiley 8c Haertel, 1996.) Does the assessment 
reflect situations or problems that are likely to be 
culturally biased? (See Gordon, 1992; Winfield, 
1995.) Potential bias with the scoring of perfor- 
mance tasks must also be considered (Baker 8c 
O’Neil, 1996). 

A third conception of fairness has a more com- 
pensatory and active flavor, focusing on the adapta- 

I - - — • 



tions and accommodations in the assessment process 
that would provide students with an opportunity to 
display their competence. Regardless of concep- 
tions, one point is clear: Available analytic tech- 
niques for examining the fairness of assessments, for 
example, linguistic complexity of tasks (e.g., Abedi, 
Lord, 8c Plummer, 1994), differential item function- 
ing (e.g., Camilli 8c Shepard, 1994; Holland 8c 
Wainer, 1993), or sensitivity reviews, neither pro- 
vide a guarantee of fairness nor are sufficient to 
support the overall validity judgment. Taking seri- 
ously the concept of fairness may well lead to alter- 
native pathways, including subgoals and measures, 
to attain common standards or to achieve more 
diverse goals for all students. 

Credibility 

I have very simple questions. Are the schools getting 
better or worse ? What can I do to help ? I can J tget a 
straight answer. 

— Chair of State Legislative Education Committee 

1994 

The third feature of our model, credibility, en- 
compasses the perceptions and values of the public in 
general and of participants directly involved in edu- 
cation activities and decisions. Unfortunately, at the 
present time many Americans do not feel well served 
by educational information they receive; they feel 
closed out and uninformed (McDonnell, 1995). 
The diverse and shifting positions on controversial 
educational issues have increased suspicion about 
the validity, veracity, and propriety of centralized 
sources of information. In addition to privacy con- 
cerns, a significant segment of the public, often fed 
by cynicism of many in the media, believe that they 
are not being given a truthful picture by arms of 
government at the district, state, or national levels 
(Gunther, 1992). 

Validity without credibility produces assessments 
that have no lifespan and whose findings are con- 
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tested, diminished, or dismissed. Validity, objectiv- 
ity, and fairness are elements that influence the 
degrees of trust placed in findings by different con- 
stituencies. Moreover, credibility depends on the 
quality of information, the way results are communi- 
cated, and the purposes and uses to which results are 
put. Although credibility has not typically been 
examined by those who study assessment, it has been 
the focus of research attempting to understand how 
the public uses information in forming opinions 
about political issues (e.g., Zaller, 1992), how they 
judge the trustworthiness of the media (e.g., Gaziano 
& McGrath, 1986; Gunther, 1992; West, 1994), 
and how policy elites and the public regard the 
veracity of policy analysis and of various social indi- 
cators such as unemployment statistics (Bozeman, 
1986; Innes, 1990; MacRae, 1985). 

Utility 

How can I use this in my teaching i 

— lOth-grade history teacher, 1994 
The domain of utility addresses constraints and 
options for action in the real world. We distinguish 
three main components: potential utility, action, and 
impact. 

• By potential utility we refer to the fit between 
the assessment, its design and intended pur- 
poses, and practical constraints for its use, in- 
cluding user perceptions. Potential utility de- 
pends on whether assessments are coherent, 
feasible, and cost sensitive, and whether pur- 
poses and results are clear and understandable to 
users. 

• By action we mean the degree to which and how 
assessments are actually translated into practices 
and policies: how the assessment is actually 
used. Action depends on professional develop- 
ment for users, supportive resources, and imple- 
mentation strategies. 



• By impact we mean the degree to which desired 
effects are produced and unintended negative 
effects are avoided. The model conceives impact 
as identifiable, shorter-term consequences for 
practice and policy, particularly the fairness ac- 
cruing to all parties. Broader, less direct impact 
of information includes effects on public en- 
gagement in and public views on educational 
matters and longer term changes in policy and 
educational systems. These broader, ultimate 
goals are conceived as societal impact in our 
model. 

The potential utility of new and traditional forms 
of assessment has been widely debated (Herman, 
1992; Resnick 8c Resnick, 1992; Shepard, 1995; 
Wiggins, 1989; Wolf, Bixby, Glenn, 8c Gardner, 
1991), and substantial research has identified the 
general challenges to moving from potential to the 
reality of action and impact (Gearhart 8c Wolf, in 
press; Herman 8c Klein, in press; Koretz, Stecher, 
Klein, 8c McCaffrey, 1994; Stecher 8c Herman, in 
press). 



Utility depends on validity, fairness 
and credibility... 



The utility of an assessment or accountability 
system may vary for different actors and audiences. 
Three prime audiences — teachers, policy makers, 
and the general public — are emphasized in the pro- 
posed research and development work of CRESST. 
Our arenas of action are assessment systems for 
teaching and learning and large-scale assessment 
systems for equity, policy, and public understanding. 
Utility depends on validity, fairness, and credibility 
but may be as much influenced by local circum- 
stances, unforeseen expectations of constituencies, 
and the personal predilections of leadership. 
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The heart of our model of utility is the person, the 
human dimension, not abstract methodology, a 
particular analytic technique or any preferred form or 
format of test. For it is, after all, people who must 
make inferences that are accurate, fair, and appropri- 
ate for particular purposes and students. People 
make the judgments about whether they can trust, 
can understand, and will value and use information. 
People take action or do nothing; their choices of 
action fit to real limits of available knowledge, sense 
of benefit, and understanding of costs. Creating 
better methods, high-precision techniques, more 
inclusive assessments, and glossier, high-tech reports 
of results is of little use unless people use assessment 
results wisely to achieve worthy goals. 

Core Problems Guiding the New CRESST 
Research Agenda: Assessment System Goals 
and Validity Agenda 

Given the sheer number of issues that arise in the 
validity, fairness, credibility, utility and ultimate im- 
pact of any assessment system, we have chosen to 
highlight here four research areas in our new pro- 
gram: system coherence, adaptations and accommo- 
dations of assessments, the measurement ofprogress, 
and reporting. 

System Coherence and Multiple Measures 

Heavy demands are placed on assessments. They 
are expected to serve a range of purposes, yet their 
ability to do so requires coherence within and across 
elements of the educational system as well as within 
the system’s assessments (Smith 8c Levin, 1996). 

Aligning assessment and curriculum. Both past 
experience and common sense indicate that assess- 
ments do more than simply provide information 
when assessment results are made highly visible and 
used to hold educators accountable. Assessments 
influence what gets emphasized in the classroom and 



what falls by the wayside (Koretz, 1988; Madaus, 
1988; Shepard, 1991; Smith, 1991). Indeed, it is the 
recognition that assessments can influence instruc- 
tion that contributes to their appeal to policy makers 
as potential tools of educational reform. 

One of the lessons learned from accountability 
systems of the past is that system coherence is essen- 
tial. If the assessments that count for accountability 
purposes are out of alignment with desired class- 
room practice, they will reshape the enacted curricu- 
lum to mirror the accountability measures and, if 
inappropriate, distort teaching and learning. In con- 
trast, where clearly connected to important system 
goals, assessments may support desirable coherence. 
Not only must assessments be aligned with content 
and performance standards; the whole system — 
curriculum materials, teaching strategies, profes- 
sional development, incentives, sanctions, and ex- 
pectations of various levels — must be aligned. 



A given assessment may be well suited 
to meet some expectations but poorly 
adapted for others. 



Connecting the information from multiple mea- 
sures. Another lesson learned from accountability 
assessments of the past is that it is difficult for a single 
assessment to serve multiple purposes well without a 
major redesign effort (Baker, Linn, Abedi, 8c Niemi, 
1996). A given assessment may be well suited to 
meet some expectations but poorly adapted for 
others. A teacher, for example, needs specific infor- 
mation on an immediate basis to guide short-term 
instructional decisions. External assessments yield 
information that is both too general and too slow in 
coming to be useful for making day-to-day instruc- 
tional decisions. The informal assessment informa- 
tion that teachers rely upon for those day-to-day and 
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moment-to-moment decisions, on the other hand, 
even when compiled in a student portfolio (Gearhart 
& Herman, 1995), is too idiosyncratic to be useful 
for informing policy makers or the public objectively 
about overall student achievement. 



Such [test] proliferation obviously 
increases the overall assessment bur- 
den... 



The various, often incompatible, demands placed 
on assessments have led to recommendations that 
assessments be tailored to specific uses. Although 
nominally sensible, such recommendations, in turn, 
have led to a proliferation of testing. This is evident 
from even a simple listing of the assessments in which 
a given student may be required to participate — 
teacher- made assessments, instructionally embed- 
ded tests that accompany textbooks and instruc- 
tional materials, criterion- referenced tests required 
by the district, a norm-referenced test used, for 
program evaluation, a criterion-referenced statewide 
assessment, and even, perhaps, the Trial State Assess- 
ment of the National Assessment of Educational 
Progress. Such proliferation obviously increases the 
overall assessment burden and, concomitandy, cre- 
ates a problem of integrating findings into sensible 
inferences and actions. Unfortunately, proliferation 
does not necessarily solve the problem of matching 
assessments to use, since several of the assessments 
may still be expected to serve the same purpose, and 
other purposes may be inadequately served. More- 
over, multiple assessments can lead to real and appar- 
ent conflicts in both interpretation and use of results 
when the different assessments emphasize different 
content or types of skills. 

In any event, assessment and accountability sys- 
tems almost always involve multiple measures. At the 



simplest level, these may simply be scores in different 
content areas. More complicated systems may in- 
clude multiple types of assessment data such as 
teacher-scored student portfolios, results of cen- 
trally-scored performance assessment tasks, and stan- 
dardized tests. Although each measure may yield 
useful results when considered alone, an assessment 
system implies the combination of information in 
meaningful ways. Sometimes, actions are required 
by legislation or board policy based on cumulative 
information across all measures, for instance, the 
decision to designate a school for a school improve- 
ment program or the distribution of awards to 
schools (see, for example, Crone, Long, Franklin, & 
Halbrook, 1994; Improving America’s Schools Act 
of 1994; Kentucky Department of Education, 1995; 
Mandeville, 1988; Sanders & Horn, 1993). 



...procedures and strategies are 
needed to assure that the multiple 
measures contribute to, rather than 
undermine, coherence. 



Systems clearly need to allow for the flexible 
inclusion of multiple measures, but procedures and 
strategies are needed to assure that the multiple 
measures contribute to, rather than undermine, co- 
herence. In particular, there are two potential prob- 
lems that need to be addressed in dealing with 
multiple measures: (a) Redundancy across measures 
in the underlying constructs assessed may introduce 
unintended weightings in composite or summary 
scores; and (b) Important distinctions may be lost 
when results of multiple measures are combined. 
Multivariate analyses are needed to disentangle the 
redundancy and expose the different aspects of per- 
formance that support the overall validity of system 
interpretations. An important focus of CRESST 
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research, system coherence also requires that the 
information from multiple measures be combined in 
ways that are consistent with purposes and that 
provide information about status and progress. 

Adaptations and Accommodations 

Recent federal legislation (Improving America’s 
Schools Act, 1994; Individuals with Disabilities 
Education Act, 1990) presents states, districts and 
schools with new challenges in providing disabled 
students the least restrictive environment and in 
encouraging states and districts to include in their 
large-scale assessments students with Individualized 
Education Plans (IEPs) and language minority stu- 
dents who have traditionally been excluded. How do 
we accommodate the needs of students of varying 
abilities and disabilities within mainstream instruc- 
tional and assessment programs? How do we adapt 
assessments to the needs of language minority stu- 
dents whose achievement has heretofore been largely 
unexamined? How do we assure accurate placement 
of students with varying abilities and language capa- 
bilities? There is little research to date to guide policy 
and practice (August et al., 1994). Particularly per- 
plexing is the issue of assuring fairness to all stu- 
dents — both those who are designated for special 
services and accommodations and those who are not. 

Problems of exclusion. A common practice in the 
past has been simply to excuse students from partici- 
pation in the assessment for whom it is deemed 
inappropriate. Yet such exclusions raise important 
fairness issues and can distort overall assessment 
results (Haladyna, Nolen, & Haas, 1991), particu- 
larly when the rules for exclusion vary from one site 
to another or from one time to another at a given site. 
Moreover, without inclusion, the assessment system 
provides no information about a sometimes sizable 
proportion of the students participating in the edu- 
cation system, which may reduce the likelihood that 



these students receive the services they need to 
achieve the content standards being assessed. 

Recent experience with NAEP provides some 
indication of the scope of the exclusion problem. In 
the 1992 NAEP administration, approximately 5% 
of the sampled students were excluded from the 
assessment because an IEP judged it inappropriate 
for the student to participate in the assessment. The 
State Trials showed that IEP exclusions ranged from 
2% to 8% (National Academy of Education, 1993), 
a range that likely reflects differences in state and 
local policies and practices rather than any differ- 
ences in special learning needs of students from state 
to state. On the other hand, variations in exclusion 
rates for language minority students, which ranged 
from less than 2% to 11% for Grade 4 reading, likely 
reflect differences in immigration patterns as well as 
differences in inclusion policies and practices. 



The pressure is to exclude students — 
from the assessment and, perhaps, 
from meaningful instruction. 



Though seldom discussed in public reports of 
results, exclusion rates on district administered stan- 
dardized tests often are as high or higher than those 
on NAEP, and many state assessments also exclude 
a substantial number of students because of language 
minority or IEP status. Exclusions are generally 
motivated by concern that the assessment is inappro- 
priate for the excluded student, but in some high- 
stakes situations, exclusion from testing or retention 
may also come about as a way of inflating scores 
(Darling-Hammond, 1995;Zlatos, 1994). The pres- 
sure is to exclude students — from the assessment 
and, perhaps, from meaningful instruction. 

Issues in inclusion. Clearly assessments designed 
for large-scale, on-demand administrations may be 
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inappropriate for some students due to language 
requirements or because the tasks and response 
demands are not suitable for the current instruc- 
tional levels of disabled students (Thurlow, Ysseldyke, 
&Silverstein, 1993). Special needs students who are 
to be held to the same standards as other students 
may need accommodation in test format, for in- 
stance, large-print versions of the test, or in testing 
environment, for example, a test carrel (Amos, 
1980; Beattie, Grise, &Algozzine, 1983;Wildemuth, 
1983). Other accommodations for students with 
disabilities may also include more breaks during 
testing, or extended testing time, perhaps over sev- 
eral days. Inclusion of students who have been 
excluded in the pasr clearly cannot achieve the goal 
of fairness unless participation of those students is 
meaningful and leads to valid interpretations and 
actions (i.e., the assessment interpretations and uses 
have an acceptable degree of construct validity) 
(Sherman & Robinson, 1982). 



...there has been little or no research 
investigating the validity ofinferences 
from these adaptations or alterna- 
tives. 



In spite of available or newly developed adapta- 
tions, there are some students for whom these test 
adaptations are still inappropriate. Alternative as- 
sessments are needed for these students (see Ken- 
tucky Portfolios for Special Education, Kentucky 
Department of Education, 1995). Although prom- 
ising, there has been little or no research investigat- 
ing the validity ofinferences from these adaptations 
or alternatives. 

Some states have made a strong commitment to 
the idea of including all or nearly all students by 
offering assessments in languages other than English 



and by allowing for adaptations or accommodations 
for students with diverse needs, for example, extra 
time, or oral administration (see Kentucky Depart- 
ment of Education, 1995), but have not examined 
the construct equivalence of these measures. Spanish 
language versions of assessments in content areas 
have been used in some states (e.g., California and 
Texas), simply ignoring the confounding issues of 
language of instruction, prior educational history, 
and cultural differences (Duran, 1992; Geisinger, 
1992; Valdes & Figueroa, 1994). Offering a test in 
both English and Spanish, furthermore, does not 
assess subject area competency of students not fully 
literate in either language. 



Defensible uses and interpretations 
of results based on adaptations and 
accommodations need to be articu- 
lated and justified. 



Parents and teachers of students with disabilities 
and those with limited English language proficiency 
want their children included in testing for purposes 
of accountability and to improve the education of 
their children. At the same time, they certainly do 
not want their children hurt, frustrated, or treated 
unfairly by inclusion. At issue, then, is how eligibility 
for accommodations and adaptations should be de- 
termined, what modifications should be permitted, 
and how scores obtained under nonstandard condi- 
tions should be reported. Defensible uses and inter- 
pretations of results based on adaptations and ac- 
commodations need to be articulated and justified. 
Inappropriate uses and interpretations also need to 
be identified. 

Choices about the basis for accommodation are 
influenced by empirical evidence concerning the 
equivalency of constructs measured in different lan- 




14 



14 



CRESST: A Continuing Mission To Improve Educational Assessment 



guages (LaCelle-Peterson & Rivera, 1994), back- 
ground knowledge (Cole & Scribner, 1973; Johnson, 
1992), instructional opportunity (Baker & Rogosa, 
1995; Herman, Klein, Heath & Wakai, 1994; Pullin, 
1994; Winfield & Woodard, 199 3), and motivation 
(Ogbu, 1978). It may be unfair, for example, to use 
student performance on an assessment aligned with 
newly adopted content standards and curriculum to 
compare or make quality judgments about teachers 
who have had differential access to professional 
development activities designed to introduce the 
standards and curriculum. Similarly, it may be unfair 
to compare or judge students based on their perfor- 
mance on assessments that are consistent with the 
instruction provided to some students but out of 
alignment with that provided to other students. 
Thus, an analysis of the correspondence between 
what is taught and what is assessed will be an 
important aspect of the CRESST agenda. How 
validity studies can best deal direcdy with alignment 
issues is yet to be determined. 



...validation research is essential to 
provide both the evidential and con- 
sequential basis to support specific 
adaptations and accommodations and 
interpretations of results that they 
yield. 



As was argued by the National Academy of Sci- 
ence panel on Placing Children in Special Education: 
AStrategy for Equity (Heller, Holtzman, & Messick, 
1982), validation research is essential to provide 
both the evidential and consequential basis to sup- 
port specific adaptations and accommodations and 
interpretations of results that they yield. For this 
reason, the CRESST project focusing on adaptations 
and accommodations for language minority stu- 
dents and the study for IEP students outline research 



and development plans within a broad validation 
framework, and classroom-based projects address 
the problem from the perspective of actual teaching 
and learning issues. 



Parents, students, teachers, adminis- 
trators, policy makers, and the public 
share interest in a simple question: 
Are [my] children making progress? 



Measuring Progress 

Learning involves change. Hence it is no surprise 
that the measurement of change is of fundamental 
interest to many assessment and accountability sys- 
tems. Parents, students, teachers, administrators, 
policy makers, and the public share interest in a 
simple question: Are [my] children making progress? 
Are they learning? Are schools getting better? New 
Tide I regulations, furthermore, create a basic and 
substantial need to measure change in terms of 
students’ “annual yearly progress.” 

Yet the measurement of change poses substantial 
challenges, including problems of low reliability of 
change scores, confounding changes in what is mea- 
sured with changes in student performance, and 
sensitivity of growth to the particular type of scale 
used to report assessment results. Other problems 
arise when the goal is the assessment of progress for 
groups (e g., the identification of schools that are 
making adequate progress), most notably the poten- 
tial confounding effect of changes in the student 
population due to year-to-year differences or due to 
mobility. 

Sensitivity of measures of change to the scale of 
measurement is also a cause for concern because of 
the arbitrary nature of scales often used to report 
results of assessments. With standardized tests, for 
example, the pattern of growth in student achieve- 
ment appears quite different for scores based on 
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different scaling models (Linn, 1981; Linn &Slinde, 
1977; Seltzer, Frank, &Bryk, 1994). Performance- 
based assessments have not been studied as exten- 
sively with regard to this issue, but they are also 
subject to the problem that change results are sensi- 
tive to choice of scale. Regardless of form of assess- 
ment, the use of standards-based reporting proce- 
dures raises yet other complications, since the changes 
reported for individuals and for groups of students 
will be sensitive not only to gains in student achieve- 
ment, but to the number and stringency of standards 
used, as well as where on the scale the standard is set. 

Assessment and accountability systems clearly 
need to be capable of reporting progress as well as 
status of schools and districts, including intermedi- 
ate benchmarks that can be used to gauge the 
adequacy of the progress. This principle implies the 
need to attend to several technical issues, such as the 
comparability of assessments from year to year and, 
in the case of schools or school systems, the compa- 
rability of different cohorts of students. Two of the 
more important issues that need to be dealt with in 
the proposed CRESST research are the development 
of adequate procedures for estimating the degree of 
uncertainty associated with measures of student and 
school progress, and effective communication of 
that information to audiences that will use measures 
of progress. 

Fortunately, there have been substantial improve- 
ments in the analytical approaches now available for 
tackling the problems associated with measuring 
progress. New analytic models and perspectives on 
the measurement of change (e.g., Bryk & 
Raudenbush, 1987, 1992; Muthen, Huang, Jo, 
Khoo, Nelson Goff, Novak, & Shih, 1995; Muthen, 
Khoo, & Goff, 1994; Rogosa, Brandt, & Zimowski, 
1982; Rogosa & Saner, 1995a, 1995b; Rogosa & 
Willett, 1985) provide a firmer theoretical founda- 
tion for attacking the problems associated with the 



demand to measure student progress and the progress 
of educational systems. 



...research is needed to provide a 
basis for understanding the implica- 
tions of using different summaries of 
student performance. . . 



Analytical models described by Bryk and 
Raudenbush, Rogosa, and by Muthen and his col- 
leagues will serve as the starting point for CRESST 
research and development work on progress mea- 
surement. While value-added conceptions provide a 
useful framework for addressing many of the goals 
implicit in the demand to report the progress of 
schools or other aggregations of students, substan- 
tial research and development is needed to under- 
stand how best to deal with student mobility and to 
understand the implications and trade-offs of models 
that rely on year-to-year comparisons of different 
cohorts of students enrolled in a given grade (for 
example, Grade 4 students in 1996-97 compared to 
Grade 4 students in 1997-98) as compared to ap- 
proaches that rely on longitudinal samples following 
the same students across years. Similarly, research is J 
needed to provide a basis for understanding the 
implications of using different summaries ofstudent 
performance, such as group means or percentage of 
students meeting a standard, for measuring progress. 

Reporting for Understanding and Action 

Without effective communication and reporting, 
the utility of an assessment or any assessment system 
is severely compromised: Results languish unused, 
the potential of substantial investments wasted; or 
worse, results can be misused. The proliferation of 
assessments nationally and locally and the addition of 
new forms of assessment only heighten historic 
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problems in teachers’, students’ and public under- 
standing of test results and what to do with them 
(Hambleton & Slater, 1995; Herman & Dorr- 
Bremme, 1983; Stiggins, 1991). Recent media re- 
ports of the public response to new forms of assess- 
ment further underscore inherent problems of un- 
derstanding and communication (Merl, 1994; Sand- 
ers, 1995). People — students, parents, teachers, ad- 
ministrators, policy makers, the public — cannot act 
sensibly on information they either do not have or do 
not adequately understand. The issues thus are ones 
of access and distribution as well as of the clarity and 
usability of information. 



Just as a single test score cannot serve 
all purposes, a single report cannot 
meet the needs of all users. 



Just as a single test score cannot serve all purposes, 
a single report cannot meet the needs of all users. As 
most students of writi ng know, the writer is supposed 
to develop a model of what the audience expects in 
level of information, tone, and structure. Successful 
writers create good matches with their audiences. 
Some excellent writers can adapt their work for a 
wide range of audiences differing in expectation, 
knowledge, language, tolerance for detail, desire for 
entertainment, and available time to devote to the 
enterprise. Similarly, research shows that users want 
reports tailored to their needs and decision arenas, 
with direct implications for action (Herman, 1989; 
Hood at al., 1972). Furthermore, research on mul- 
tiple intelligences (Gardner, 1993) and other aspects 
of cognition suggests that multiple modalities of 
communication are essential to meet diverse cogni- 
tive styles (Snow & Lohman, 1989). 

Technology provides new possibilities for dis- 
playing and customizing information and new av- 
enues for distribution (Baker, in press). New iconic 

' — : — — y ' ' 

i—ssL — - — ■ - ... — - 



representations are possible to help guide users’ 
attention and understandings. Desktop publishing 
and automated authoring and editing systems will 
greatly ease the burden of adapting user-friendly 
reports for different constituents, and automated 
analysis routines will enable different levels of data 
aggregation for different reports. While technology 
can ease the production problem, the specifics of 
what different audiences want to know, what they 
will find credible, and how best to communicate also 
demand attention and remain prerequisite issues 
that will be addressed by our research programs. In 
collaboration with relevant constituencies, we will 
seek to understand how to combine and communi- 
cate complex information from assessment systems 
in ways that are fair, valid, and credible for different 
audiences and to serve different purposes. 



As the Internet demonstrates, a ma- 
jor shift in information access is un- 
derway; distributed use of informa- 
tion, tools, and systems is now a 
reality. 



The power of an interactive communication pro- 
cess to promote information use also is well estab- 
lished (Patton, 1988). In this area, too, technology 
dramatically opens up channels for users’ interaction 
and analysis. As the Internet demonstrates, a major 
shift in information access is underway; distributed 
use of information, tools, and systems is now a 
reality. Wider distribution to new users will require 
clear frameworks (Baker & O’Neil, 1994) for inter- 
preting results. Users of information will want to 
know how it relates to other findings. Parents and 
teachers will become interested in how they can 
replicate what is assessed (create local versions of 
standard measures), use new approaches to help 
their own children succeed, and discuss and improve 
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the educational process. Our research program will 
help to identify the requirements of credible and 
useful information systems for these various users 
and to build tools to support their use of assessment. 

Addressing an Ambitious Agenda 
CRESST has established an ambitious agenda for 
the next five years. The specific work we have pro- 
posed is guided by a shared set of beliefs about the 
nature of effective R&D in our mission area: 

• Assessment, evaluation, and accountability rep- 
resent only a small part ofwhat is truly important 
about educating our children. 

• R&D must commit to improving educational 
quality. 

• Useful R&D focuses on real problems and uses 
theoretical paths to explore their solution. 

• Collaboration is essential in identifying and 
understandingproblems and in determining the 
value of options. 

• R&D findings should be aggressively commu- 
nicated in accessible and compelling ways to all 
audiences — policy makers, politicians, teachers, 
parents, and students. Don’t wait for them to 
ask. 

• Diverse perspectives are needed to clarify real 
differences and to find equitable, workable bal- 
ances. 

• Impartiality, not advocacy, is the key to the 
credibility of research and development. 

• The best R&D meets current needs but seeks to 
redefine constraints so that creative solutions 
are possible. 

We address our mission through four highly 
interrelated programs: 

• Program One- Assessment in Action addresses 
fundamental problems in improving the utility 
of assessment at the school and classroom levels 



and, in the process, will investigate substantive 
issues ofvalidity, fairness, and credibility that are 
essential in assessment systems serving large- 
scale accountability and educational improve- 
ment purposes. 

• Program Two -Accountability, Equity , Policy, and 
Public Engagement combines active, scholarly 
reflection on the purposes, implementation, 
and effects of large-scale assessment systems 
with action-oriented, practical responses to cur- 
rent assessment design and interpretation prob- 
lems. 

• Program TYiree-Technical and Functional Qual- 
ity of Assessment Systems: Validity, Equity, and 
Utility investigates the technical uses and inter- 
pretations that are made from assessment results 
and the changes produced in the school, com- 
munity, and student body; in curriculum and 
instruction and policies for student advance- 
ment and placement; in policies and attitudes of 
the district or community; and in the natural 
perception of education. 

• Program F onv-Outr each and Dissemination will 
access the continuous feedback necessary for the 
improvement of CRESST R&D and gready 
reduce the cycle of time between assessment 
research and its application to practice. 

Our programs in teaching and learning and in 
accountability, equity, and policy reflect the arenas 
of action for our research. We consider them points 
of entry that over time will enable us to design 
assessment systems that can serve both arenas. Our 
program organization mirrors our belief that the 
improvement of assessment policy and practice re- 
quires sophisticated understanding of socio-politi- 
cal, functional, and technical problems of assessment 
systems as they exist from the top down (Program 
Two) as well as the intricacies of how assessment is 
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used and perceived from the bottom up, at the 
school and classroom learning levels (Program One) . 
Merging these two perspectives with the theoretical 
advances in fairness and validity (Program Three), 
we believe that over the next five years we can make 
an important contribution to the design and analysis 
of coherent assessment systems that serve educa- 
tional quality. 
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• Assessment and Instruction in Elementary Math- 
ematics : What We’ve Learned — Mary l Gearhart, 
CRESST/UCLA; Megan Franke, CRESST/ 
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• Model-Based Large-Scale Assessment — David 
Niemi, University of Missouri; Zenaida Munoz, 
CRESST/UCLA 

• Linking Language Arts Standards to Assessments 
— Charlotte Higuchi, CKESST /Los Angeles Uni- 
fied School District/United Teachers, Los Angeles 

4:00-5 : 15p.m. — High Technology Applications 
for the Assessment of Student Knowledge and 
Learning 

• Lessons from CRESST Technology Research 
Programs — Harry F. O’Neil, Jr., CRESST/ 
University of Southern California 

• Model-Based Computer Assessment of Problem- 
Solving Skills in Science — Ron Stevens, CRESST/ 
UCLA 

• Recent Developments in Computer Assessment 
at the Educational Testing Service — Randy 
Bennett , Educational Testing Service 

• Discussants — Eva L. Baker, CRESST/UCLA 
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sity 

Friday, September 6, 1996 

8:30-10:15a.m. — Helping All Students to Arrive 
Part I: Adaptations for Students With Disabili- 
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• Validity Issues in the Assessment of Students 
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tion (to be determined) 
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modations for Language Minority Students — 
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New Directions in Statewide Assessment 

• Duncan MacQuarrie, Washington Department 
of Education 

• Doris Redfield y Department of Education, Com- 
monwealth of Virginia 

• Wayne Martin y Colorado State Department of 
Education 

• Brian Stecher y CRESST/The RAND Corpora- 
tion 

Building Teacher Capacity for Improved Class- 
room Assessment 

• Hilda Borko y CRESST/University of Colorado 
at Boulder 

• Lynn Winters y Long Beach Unified School Dis- 
trict 



Special Issues in the Assessment of At-Risk Stu- 
dents in Large Urban Schools 

• Sidney Thompson y Los Angeles Unified School 
District 

• Carole Perlman y Chicago Public Schools 

• Ruben Carriedo y San Diego School District 

Other invited forum participants include: 

• David Stevenson, U.S. Department of Education 

• Adrienne Bailey, Council of the Great City Schools 

• Carl Cohn, Long Beach Unified School District 
(invited) 

Theresa Dozier, U.S. Department of Education 
(invited) 

3:00-4:15 p.m. Moving Ahead 

• From Learning Theory to Assessment Practice — 
Lauren Resnick, CRESST/University of Pitts- 
burgh (invited) 

• From Vision to Capacity: Building Teacher Un- 
derstandings in Standards and Assessments — 
Marilyn Monahan, National Education Associa- 
tion (invited) 

• From Disjunct to Convergence: Moving Toward 
Reality in Policy and Practice — Pascal Forgione, 
National Center for Education Statistics 

• The Road Ahead — Lee Shulman , Stanford Uni- 
versity (invited) 

4:15-4:45 p.m. Wrapping It All Up 

• Eva L. Baker, CRESST/UCLA 

• Robert L. Linn, CRESST/University of Colo- 
rado at Boulder 
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Assessing the Validity of the National Assess- 
ment of Educational Progress: NAEP Technical 
Review Panel White Paper 

Robert L. Linn , Daniel Koretz , and Eva L. Baker 
CSE Technical Report 416, 1996 ($5.00) 

Under a contract from the National Center for 
Education Statistics, the CRESST Technical Review 
Panel has conducted a series of research studies 
addressing the uses and interpretations of the Na- 
tional Assessment of Educational Progress (NAEP), 
oftentimes known as the nation’s report card. This, 
report summarizes the most important findings in- 
cluding the quality of NAEP data, the number and 
character of NAEP scales, the robustness of NAEP 
trend lines, the trustworthiness of and interpretation 
of group comparisons, the validity of interpretations 
of NAEP anchor points and achievement levels, the 
effects of student motivation on performance, the 
adequacy of NAEP data on student background and 
instructional experiences, and what is understood 
from NAEP reports by educators and policy makers. 

Performance Puzzles: Issues in Measuring Capa- 
bilities and Certifying Accomplishments 

Lauren Resnick 

CSE Technical Report 415, 1996 ($5.50) 

In this report, CRESST/University of Pittsburgh 
researcher Lauren Resnick explores major issues in 
using assessments as a means of defining standards 
and encouraging efforts to meet them. She discusses 
the differences between the purposes for traditional 
and newer types of assessment, issues of scoring 
reliability, generalizability of observed performance, 
and content and construct validity involving perfor- 
mance assessment and portfolios. 



Evidence and Inference in Educational Assess- 
ment 

Robert Mislevy 

CSE Technical Report 414, 1996 ($5.50) 
w Data” from educational assessments become “evi- 
dence” only with respect to conjectures about students 
and their work, says Robert Mislevy in this report 
based on his 1 994 presidential address to the Psycho- 
metric Society. Those conjectures are constructed 
around notions of the character and acquisition of 
knowledge and skill, and shaped by the purpose of 
the assessment and the nature of the inference re- 
quired. Using a detailed analytic framework, the 
author demonstrates how the concepts and tools of 
mathematical probability can help explain relation- 
ships between evidence and inference about students’ 
knowledge, learning, and accomplishments. 

The Role of Probability-Based Inference in an 
Intelligent Tutoring System 

Robert Mislevy and Drew Gitomer 
CSE Technical Report 413, 1996 ($5.50) 
Probability-based inference in complex networks 
of interdependent variables is an active topic in 
statistical research, spurred by such diverse applica- 
tions as forecasting, troubleshooting, and medical 
diagnosis. Based on an instructional tutoring system 
for learning to troubleshoot a military F-15 aircraft 
hydraulics system, the authors in this study explore 
the role of Bayesian inference networks for updating 
student models in intelligent tutoring systems ( ITSs). 

Latent Variable Modeling of Longitudinal and 
Multilevel Data 

Bengt Muthen 

CSE Technical Report 412, 1996 ($3.50) 

This report gives an overview of some aspects of 
latent variable modeling in the context of growth 
and clustered data. The author emphasizes the 
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benefits that can be gained from muitiievei as op- 
posed to conventional modeling techniques that 
ignore the multilevel data structure. Large-scale 
educational surveys are used to illustrate key points. 

A Simple Approach to Inference in Covariance 
Structure Modeling With Missing Data: Baye- 
sian Analysis 

Benjjt Muthen 

CSE Technical Report 411, 1996 ($2.50) 

In this report, CRESST/UCLA researcher Bengt 
Muthen investigates an improved approach for edu- 
cational analyses where there are significant amounts 
of missing data. The author found that a Bayesian 
approach developed by himself and Gerhard 
Arminger, offers a promising technique for missing 
data covariance structure modeling. The technique 
should soon be available in covariance structure 
software. 

Issues in Portfolio Assessment: The Scorability 
of Narrative Collections 

John R. Novak, Joan L. Herman, and Maryl Gear - 
hart 

CSE Technical Report 410, 1996 ($4.50) 

This report provides a model for examining tech- 
nical questions concerning the validity and reliability 
of large-scale portfolio assessment scores. One of the 
key findings was that the holistic scale of the CRESST 
“Writing What You Read” narrative rubric — a rubric 
designed to enhance teachers’ understandings of 
narrative and to inform instruction — could be used 
reliably and meaningfully in large-scale assessment of 
narrative collections. 



Final Report: Perceived Effects of the Maryland 
School Performance Assessment Program 

Daniel Koretz, Karen Mitchell, Sheila Barron, and 
Sarah Keith 

CSE Technical Report 409, 1996 ($5.50) 

In this study, CRESST /RAND researchers inves- 
tigated the effects of the Maryland School 
Performance Assessment Program (MSPAP) by sur- 
veying Maryland teachers and principals. General 
support for MSPAP as an instrument of reform (in 
contrast to its role as an assessment) was widespread 
among surveyed educators, but teachers’ views of 
MSPAP as an assessment were mixed. Large majori- 
ties of both teachers and principals reported that 
MSPAP has been at least somewhat successful in 
meeting its goal of improving instruction. 

Teachers reported relying on diverse methods to 
prepare students for MSPAP, ranging from broad 
improvements in instruction to narrowly focused 
test preparation, such as use of practice tests. Their 
explanations of MSPAP score gains in their own 
schools, however, raise the possibility that initial 
gains were inflated. About half of the surveyed 
teachers reported that work with practice tests and 
familiarity with the assessment had contributed a 
great deal to their gains, while only 15% to 20% said 
the same of improvements in knowledge and skills. 
The report recommends several lines of research to 
explore issues raised by these survey findings. 

f > 

Many CSE/CRESST Reports 
may be downloaded from the 
CRESST Web Site at 
www.cse.ucla.edu. 
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Recent CSE/CRESST Technical Reports 



Estimating the Costs of Student Assessment in 
North Carolina and Kentucky: A State- Level 
Analysis 

Lawrence O. Picus, Alisha Tralli, and Suzanne 
Tacheny 

CSE Technical Report 408, 1995 ($4.00) 

Opportunity- to -Learn Effects on Achievement: 
Analytical Aspects 

BengtMuthen, Li-Chiao Huan, Siek-Toon Khoo, Gin- 
ger Nelson Goff, John Novak, and Jeff Shih 
CSE Technical Report 407, 1995 ($2.50) 

Teachers’ and Students’ Roles in Large-Scale 
Portfolio Assessment: Providing Evidence of 
Competency With the Purposes and Processes of 
Writing 

Maryl Gearhart and Shelby Wolf 

CSE Technical Report 406, 1995 ($4.00) 

Patterns of Performance Across Different Types 
of Items Measuring Knowledge 
of Ohm’s Law 

Brenda Sugrue, Rosa Valdes, Jonah Schlackman, and 
Noreen Webb 

CSE Technical Report 405, 1995 ($2.50) 

Using Group Collaboration as a Window Into 
Students’ Cognitive Processes 
Noreen Webb, Kariane N enter, Alexander Chizh ik, 
and Brenda Sugrue 

CSE Technical Report 404, 1995 ($2.50) 

Instructional Influences on Content Area Expla- 
nations and Representational Knowledge: 
Evidence for the Construct Validity of Measures 
of Principled Understanding — Mathematics 

David Niemi 

CSE Technical Report 403, 1995 ($8.00) 



Monitoring and Improving a Portfolio Assess- 
ment System 

Carol My ford and Robert Mislevy 
CSE Technical Report 402, 1995 ($4.50) 

Comparing Reliability Indices Obtained by Dif- 
ferent Approaches for Performance Assessments 
Jamal Abedi , Eva Baker, and Howard Her l 
CSE Technical Report 401, 1995 ($2.50) 

Portfolio Driven Reform: Vermont Teachers’ 
Understanding of Mathematical Problem Solv- 
ing and Related Changes in Classroom Practice 

Brian Stecher and Karen Mitchell 
CSE Technical Report 400, 1995 ($5.00) 

Measurement of Teamwork Processes Using 
Computer Simulation 

Harold F. O^Neil, Jr., Gregory K. Chung, and Rich- 
ard S. Brown 

CSE Technical Report 399, 1995, ($5.00) 

Cognitive Analysis of a Science Performance 
Assessment 

Gail Baxter, Anastasia Elder, and Robert Glaser 
CSE Technical Report 398, 1995 ($5.00) 

Contact Kim Hurst at 310-206-1532 or 
C( kim@cse. ucla.edu” for a current CRESST Product 
Catalog with additional listings. 
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